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ABSTRACT 

This  paper  describes  a  further  extension  of  the  combining  fetch-and-add  used 
in  shared  memory  multiprocessors  such  as  the  NYU  Ultracomputer  and  the 
IBM  RPn.  This  extension  permits  the  elimination  of  busy-waiting  in  a  number 
of  important  parallel  operations.  These  operations  include  barrier  synchroniza- 
tion, stack  and  queue  operations,  full/empty  bits,  and  group  lock;  most  of 
these  operations  become  one  or  two  single-instruction  operations  for  the  PE. 
The  hardware  necessary  is  compatible  with  the  design  of  the  switches  which 
are  used  in  the  Ultracomputer,  and  in  fact  are  implemented  as  fetch-and-add 
instructions  with  an  increment  of  unity. 

1.    Summary 

In  this  paper  we  describe  a  relatively  minor  architectural  addition  to  the  combining 
fetch-and-add  instruction  which  has  been  used  on  a  number  of  shared  memory  machines.  The 
main  idea  is  to  provide  a  (micro-)  programmable  interface  between  the  network  and  the 
memory.  This  programmable  interface  (or  memory  controller),  which  we  refer  to  as  the  add- 
and-lambda  processor  (or  A&Lp)  would  receive  from  the  network  an  address  and  an  incre- 
ment, and  would  be  responsible  for  modifying  the  memory  and  returning  the  appropriate  value 
to  the  network.  The  program  executed  by  the  A&Lp  would  normally  be  regarded  as  part  of 
the  system,  and  not  normally  accessible  to  users. 

We  show  how  this  permits  more  efficient  implementation  of  a  number  of  parallel  pro- 
gramming primitives,  in  some  cases  without  busy-waiting.  We  give  implementations  of  the  fol- 
lowing form: 


barrier  synchronization  (few  PEs) 
queue  insertion 

queue  extraction 

stack  insertion 
stack  extraction 
read-when-full  ( 1  reader) 


F&A(addr,l); 

qfF&A(addrl,l)  mod  n]  :=  item; 
F&A(addr2,l); 

item  :=  q[F&A(addrl,l)  mod  n); 
F&A(addr2,l); 


s[F&A(addrl,l)l  :=  item; 
F&A(addr2,l); 

item  :=  s[F&A(addrl,l)l; 
F&A(addr2,l); 

V  :=  F&A(addrl,0); 


write-when-empty  ( 1  writer) 

F&A(addr2,v); 

None  of  these  involve  busy  waiting.  In  addition,  we  show  how  code  for  the  following  opera- 
tions: 

lock 

barrier  synchronization  (many  PEs) 

read-when-full  /  steal  /  lock  (multiple  readers) 

can  be  implemented  with  code  of  the  form: 

if  F&A(addrI.l)  <  >  0  then 

while  F&A(addr2,l)  <  >  0  loop  null  endloop; 

with  less  iterations  of  the  busy-wait  loop. 

1.1.    A  Simple  Example  -  Barrier  Synchronization 

Before  going  into  details,  we  give  an  informal  (and  somewhat  simplified)  description  of 
how  the  extensions  we  propose  might  work.  We  consider  the  barrier  synchronization  operation 
for  a  small  number  of  PEs  on  a  shared-memory  multiprocessor  with  a  combinable  F&A  opera- 
tion. For  this  operation,  each  of  a  number  n  of  processes  must  wait  at  a  synchronization 
point  in  the  program  until  the  other  n-1  reach  such  a  point.  Each  PE  does  this  by  executing 
F&A(a,l); 

The  effect  of  this  instruction  is  to  send  a  fetch-and-add  request  to  memory  location  a.     If  a 
number  (say  m)  of  PEs  execute  this  instruction  at  the  same  time,  the  combining  fetch-and-add 
network  will  deliver  to  the  memory  a  request  which  looks  like: 
F&A(a.m) 

With  the  add-and-lambda  approach,  this  request  is  processed  by  the  add-and-lambda  proces- 
sor (A&Lp)  in  the  following  way.  The  A&Lp  looks  at  location  a  to  sec  what  kind  of  an  object 
it  is.  It  finds  that  it  is  tagged  as  a  barrier-synchronization  object,  with  a  count  of  n.  Ifm  is 
equal  to  n.  it  returns  a  value  of  zero  through  the  network,  so  the  n  requesting  PEs  get  distinct 
values  for  i  between  0  and  n-I.  Since  there  were  n  requests,  the  PEs  are  synchronized,  and 
can  proceed. 

If  m  is  not  equal  to  n.  which  is  usually  the  case,  the  A&I,p  does  not  respond  to  the 
request  immediately,  but  puts  the  request  in  a  queue  of  unanswered  requests,  and  accumulates 
the  number  of  such  requests  in  a  field  (say  c)  in  a.  When  the  value  of  c  reaches  n,  the  A&Lp 
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responds  to  all  the  requests  in  the  queue.  In  this  case  the  values  returned  are  not  significant. 

The  result  is  a  single-instruction  barrier  synchronization  operation  with  no  busy  waiting. 
(Note  that  this  solution  is  good  for  a  small  number  of  PEs,  but  not  for  a  large  number,  since 
the  responses  to  the  requests  will  actually  be  issued  serially  by  the  A&Lp  --  we  discuss  another 
solution  for  this  problem  below). 

2.  Introduction 

The  combining  fetch-and-add  (F&A)  operation  has  been  shown  |GGKMRS83,  GLR83. 
D85]  to  be  capable  of  implementing  a  number  of  important  parallel  operations  in  a  shared- 
memory  multiprocessor.  In  simple  operations,  such  as  assigning  the  elements  of  a  vector  to 
different  processors  (PEs),  where  the  assignments  made  by  the  combining  network  arc  correct, 
F&A  gives  good  results.  However,  in  more  complicated  cases,  the  solutions  are  sometimes 
inefTicient  and  complex.  This  can  be  explained  to  some  extent  by  tiic  observation  that  the 
combining  network  does  not  have  enough  state  information  to  make  correct  decisions,  which 
may  subsequently  have  to  be  corrected  by  the  processors,  requiring  further  coordination. 
Even  in  the  relatively  simple  case  of  parallel  queue  operations,  five  accesses  to  the  global 
memory  arc  needed,  and  the  algorithm  contains  highly  non-intuitive  code  to  avoid  a  tricky 
race  condition. 

In  [1186]  it  was  shown  that  many  of  these  operations  could  be  made  considerably  more 
efficient  by  moving  some  of  the  critical  decision-making  to  a  more  logical  place,  namely  to  the 
network-memory  interface.  The  operations  required  were  more  general  than  simply  increment- 
ing the  value  in  the  memory,  and  were  called  operations,  or  more  simply  A&L.  In  this  paper 
we  present  an  alternative  formulation  of  these  operations,  which  we  will  refer  to  as  A&L2. 
The  main  advantage  of  this  formulation  is  that  it  permits  the  possibility  of  eliminating  busy- 
waits  in  some  cases. 

3.  Add-and-Lambda 

In  the  following  sections,  we  describe  three  ways  in  which  tlic  A&Lp  might  he  used. 

3.1.    Version  I 

In  A&Ll,  described  in  more  detail  in  |II86],  we  considered  a  memory  location  to  consist 
of  a  number  of  fields,  and  operations  on  these  locations  as  F&As  with  the  increment  being  a 
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value  of  1  in  one  or  more  fields.  These  instructions  will  be  combined  by  the  network  in  the 
usual  way,  ending  up  at  the  memory  location  in  the  form  of  a  positive  increment.  If  the  fields 
are  large  enough,  this  increment  specifies  unambiguously  how  many  increments  have  been 
requested  for  each  field.  If  the  number  passed  back  through  the  network  is  positive,  the  result 
is  cflectively  a  number  of  independent  F&As.  Furthermore,  values  passed  back  through  the 
network  in  fields  not  used  by  the  F&A  operations  will  not  be  changed  by  the  network,  permit- 
ting the  communication  of  state  information  back  to  the  PEs. 

Significantly  more  power  can  be  provided  by  even  a  modest  processor  provided  at  the 
network-memory  interface.  This  permits  the  operation  on  the  memory  location  to  be  more 
complex  than  just  a  simple  increment,  and  the  value  returned  through  the  network  to  carry 
significantly  more  information  back  to  the  PEs.  In  [1186]  examples  are  given  of  some  primi- 
tives which  can  be  reduced  by  a  factor  of  10  in  complexity  using  this  method,  both  in  the 
number  of  network  operations,  and  in  the  number  of  PE  instructions. 

The  work  in  [1186]  suggested  that  an  A&I,  microcoded  processor  with  less  than  2,56 
microwords  and  several  relatively  narrow  adders  would  be  adequate  for  many  applications. 

3.2.    Version  2 

Each  A&L2  object  resides  in  a  single  memory  module,  though  it  may  occupy  a  number  of 
memory  locations.  (Note  that  global  memory  is  usually  mapped  so  that  successive  addresses 
are  in  different  memory  modules,  to  reduce  contention  when  PEs  are  processing  linear  struc- 
tures. This  would  then  require  that  the  addresses  used  to  store  an  A&L2  object  not  be  adja- 
cent, which  could  cause  some  inefficiency  in  storage  management.  In  some  machines,  such  as 
the  IBM  RP3,  both  scattered  and  sequential  mapping  schemes  are  available;  however,  it  would 
be  necessary  for  sequentially  mapped  addresses  to  be  globally  accessible). 

Each  operation  on  an  A&L2  object  is  a  F&A  operation  with  three  parameters,  the 
address  of  the  object,  an  opcode  and  an  increment.  F&A  operations  will  be  combined  in  the 
usual  way  by  the  PE-memory  network,  but  only  operations  on  the  same  object  with  the  same 
opcode  can  be  combined. 

The  function  of  the  A&Lp  can  thus  be  thought  of  as  a  procedure  which  receives  as  its 
arguments  the  base  address  b.  the  opcode  op,  the  increment  from  the  network,  and  generates 
the  value  to  be  returned  through  the  network.    The  A&Lp  may  also  access  and  change  the 
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contents  of  the  memory  module.  Its  overall  action  is  as  Toliows: 

procedure  A&L  (b:  in  address;  op:  in  opcode; 

inc:  in  nwword;  rv:  out  nwword)  is 
begin 

rv  :=  lambda  l(b,op,inc); 

b  :=  lambda2(b,op,inc); 
end  A&L; 

where  lambda!  and  lambda2  are  the  functions  we  will  be  discussing. 

On  the  Ultracomputer  design,  the  standard  combining  network  will  only  combine  identi- 
cal addresses,  so  we  can  get  the  effect  we  want  by  using  spare  address  bits  to  specify  the 
opcode.  In  this  paper  we  will  write  such  an  operation  in  the  form 
F&A(ObjectAddress  +  opcode,increment).  All  the  examples  we  describe  in  this  paper  can  be 
implemented  by  16-byte  objects,  so  there  are  4  bits  potentially  useable  for  opcodes  (though 
none  of  our  examples  has  more  than  four  opcodes). 

The  base  word  of  each  A&L  object  will  be  at  the  address  obtained  by  masking  out  the 
opcode,  and  will  be  used  to  store  the  state  of  the  object,  whicli  can  be  initialized  by  standard 
write  operations.  {Larger  A&L  objects  could  also  be  accomodated  by  ensuring  that  the  base 
word  contains  state  information  which  identifies  the  real  base  address  of  the  object.  Smaller 
objects  could  be  represented  by  using  the  base  word  to  specify  how  much,  if  any.  of  the 
address  should  be  regarded  as  opcode). 

3.3.    Eliminating  Busy-waiting 

Many  A&L-based  algorithms  require  a  PE  to  busy-wait  in  some  conditions.  This  is 
clearly  undesirable  for  two  reasons:  it  wastes  PE  time,  and  it  increases  network  traffic.  We 
describe  here  how  much  of  this  busy-waiting  can  be  eliminated. 

In  many  of  the  algorithms  we  have  looked  at,  the  A&L  processor  is  aware  of  which 
responses  will  require  the  PE  to  busy-wait.  This  busy-wait  would  not  have  occurred  if 
requests  (from  other  PEs)  had  arrrived  previously.  What  we  propose,  therefore,  is  that  the 
A&Lp  be  given  the  ability  to  postpone  dealing  with  a  request,  by  putting  it  on  a  queue  of 
requests  for  the  specified  location.  This  queue  can  be  stored  in  the  local  memor\'  of  the  A&Lp. 
using  standard  linked  list  techniques.  Each  entry  will  contain  the  number  of  the  requesting 
processor,  and  usually  some  indication  of  the  appropriate  response.  Enough  memory  will  be 
required  in  the  worst  case  for  one  entry  per  processor,  which  will  in  general  be  negligible 
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compared  with  the  size  of  memory  handled  by  the  A&Lp. 

We  discuss  below  the  efTect  of  the  enqueing  operation  on  the  lengths  of  queues  within  the 
switches  themselves. 

4.    An  Example  -  the  Two-way  Lock 

We  illustrate  the  technique  by  giving  as  an  example  a  primitive  we  call  a  two-way  lock, 
called  a  readers-readers  lock  in  [],  and  which  is  a  variant  of  the  group  lock  proposed  by  Dimi- 
trovsky  [D86]. 

4.1.  Definition 

The  two-way  lock  provides  four  operations  on  an  object  b; 

lockl  (b); 
unlockl  (b); 
lock2  (b); 
unlock2  (b); 

with  the  property  that  if  a  number  of  processes  execute: 

lockl  (b);  codel;  unlockl  (b); 
or 

lock2  (b);  code2;  unlock2  (b); 

no  instance  of  codel  will  execute  at  the  same  time  as  an  instance  of  code2.  and  there  is  no 
starvation.  The  two-way  lock  is  thus  similar  to  a  readers-writers  lock,  but  without  writer 
preference,  and  without  writer  mutual  exclusion. 

4.2.  Applications 

The  applications  of  two-way  locks  include  all  the  applications  of  group  locks  described  in 
[D86],  including: 

-  parallel  queues,  with  codel  being  insert,  and  code2  being  extract:  it  is  straightforward 
to  implement  parallelizable  versions  of  insert  using  F&A:  the  two-way  lock  keeps  them 
separate  from  the  extracts; 

-  parallel  stacks,  for  the  same  reason; 

-  readers-writers,  with  no  starvation:  codel  is  read,  and  codc2  is  write;  serialization  of 
the  writers  can  be  done  by  implementing  lock2  to  return  the  position  of  the  process 
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within  the  subgroup;  furthermore,  redundant  writes  can  he  eliminated  if  only  the  proces- 
sor who  gets  a  1  actually  writes  (see  Appendix  1  for  details). 

5.    Two-way  Lock  without  Busy-waiting 

For  this  lock  we  will  need  four  operations,  and  the  object  will  he  in  one  of  four  states: 

11:  lockl  accepting 
ul:  unlock!  accepting 
12:  lock2  accepting 
u2:  unlock2  accepting 

and  if  we  refer  to  processes  issuing  locki  requests  as  i-processes,  will  have  the  following  struc- 
ture: 

type  two_way_lock  is 
record 

state:  (Il.ul.l2,u2)  :=  11; 

current:  integer  :  =  0;  —  #  i-processes  currently  accepted 

finished:  integer  :=  0;  —  #  i-processes  which  have  unlocked 

other:  integer  :=  0;  —#  non-i-processes  waiting 

nextg:  integer  :  =  0;  --  #  i-processes  waiting 

otherq:  queue  :  =  null;  —  queue  of  delayed  i-processes 

nextgq:  queue  :=  null;  —  queue  of  delayed  non-i-processes 

end; 

The  algorithm  we  use  is  as  follows:  lockl  requests  are  accepted  when  in  state  11.  returning 
the  position  in  the  current  1 -group  to  the  requesting  PE.  A  lock2  request  when  in  state  11 
closes  the  1 -group,  changing  the  state  to  ul;  the  response  to  this  request  is  delayed  until  the 
current  1 -group  is  finished.  When  in  state  ul,  lockl  and  lock2  requests  are  accumulated,  but 
the  responses  are  delayed  until  the  appropriate  groups  are  started,  when  the  requesting  PEs 
receive  their  position  in  the  group.  Also,  in  state  ul,  unlock  1  requests  are  counted  till  they 
equal  the  number  of  accepted  lockl  requests,  at  which  time  the  state  changes  to  12,  the  queued 
lock2  requests  can  proceed,  and  the  rejected  lockl  requests  become  next  in  line.  The  algo- 
rithm is  as  follows  (we  use  the  notation  a  +  =  b  for  a  :=  a  +  b): 

case  op  of 

lockl:  service_lockl; 

unlock  1:  service_unlockl; 

lock2:  service_lock2;  -  analogous  to  lockl 

unlock2:  service_unlock2;  --  analogous  to  unlock  1 

end  case; 

where  servicejockl  is  the  following: 
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case  state  of 

11:  rv  :=  current;   current  +  =  inc;   reply(rv); 
ul:rv:=  nextg;   nextg  +=  inc;   enqueue(rv,nextgq); 
12:  if  current  =  finished  then 

current  :=  inc;    finished  :=  0; 
state  :=  11;    reply(O); 
else 

rv  :  =  other;    other  +  =  inc; 
state  :  =  u2;   enqueue(rv,otherq); 
endif; 
u2:  rv  :  =  other;    other  +  =  inc; 
enqueue(rv,otherq); 
end  case; 

and  service_unlockl  is: 

case  state  of 

ll:rv:=  finished;   finished  +=  inc;    reply(rv); 
ul:  rv  :=  finished;   finished  +  =  inc;    reply(rv); 
if  current  =  finished  then 

current  :=  other;    finished  :=  0;    rcplyall(otherq); 
other  :  =  nextg;    otherq  :  =  nextgq; 
nextg  :=  0;   nextgq  :=  null; 
if  other  =  0  then     {12} 

state  :=  12; 
else  {u2} 

state  :=  u2; 
endif; 
endif; 
12.  u2: 
error; 
end  case; 

The  appropriate  action  for  erroneous  requests  (unlocks  without  locks)  will  in  general  depend 
on  the  application,  but  could  be  to  return  a  non-zero  reject  field. 

The  code  for  the  lock2  operation  is  just  F&A  (b  +  lock2.  1)  which  returns  the  position  in 
the  group;  the  code  for  unlock2  is  just  F&A  (b  +  unlock2,  1). 

6.    Two-way  Lock  with  Busy-waiting 

For  technical  reasons  it  is  convenient  to  use  8  states,  so  a  two-way-lock  will  have  the  fol- 
lowing structure: 

type  two_way_lock  is 
record 

state:  (0..7)  :=  0; 
current:  integer  :  =  0; 
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finished:  integer  :  =  0; 
other:  integer  :  =  0; 
nextg:  integer  :  =  0; 
end; 

In  addition,  when  a  request  is  rejected,  we  need  to  be  able  to  use  a  field  in  the  rv  word  to 
specify  the  state  which  the  PE  must  wait  for;  we  will  use  the  notation  rv. reject  for  this  field. 
This  field  should  be  positioned  so  that  as  rv  is  passed  back  though  the  network,  the  value  of 
rv. reject  will  not  change,  and  rv  will  not  go  negative.  It  is  convenient  to  make  the  rv. reject 
field  4  bits  wide,  so  that  values  of  the  form  (i  + state),  where  i  <  4,  can  be  passed  back.  The 
PE  can  then  use  a  non-zero  test  to  detect  rejection,  and  the  lower  3  bits  to  specify  the  state  to 
wait  for.  (Actually,  the  state  specified  for  will  be  either  an  II  or  an  12,  hut  the  processor  will 
also  have  to  check  for  the  corresponding  ul  or  u2,  since  the  group  may  be  closed  immedi- 
ately). 

The  algorithm  we  use  is  as  follows:  lock  I  requests  are  accepted  when  in  state  11,  returning 
the  position  in  the  group  to  the  requesting  PE.  A  lock2  request  when  in  state  II  closes  the 
group,  changing  the  state  to  ul;  the  response  to  this  request  specifies  a  non-zero  state,  whose 
lower  3  bits  specify  the  state  the  PE  must  wait  for.  When  in  state  ul.  lock  I  and  lock2 
requests  are  rejected,  but  their  number  is  accumulated,  and  the  requesting  PEs  receive  the 
state  they  should  wait  for,  and  their  position  in  the  group.  Also,  in  state  ul,  unlock  1  requests 
are  counted  till  they  equal  the  number  of  accepted  lockl  requests,  at  which  time  the  state 
changes  to  12,  the  rejected  Iock2  requests  can  proceed,  and  the  rejected  lockl  requests  become 
next  in  line.    The  algorithm  is  as  follows  (we  use  the  notation  a  +  =  b  for  a  :=  a  +  b): 

case  op  of 

lockl:  service_lockI; 

unlock  1:  service_unlockI; 

lock2:  service_lock2;  --  analogous  to  lockl 

unlock2:  service_unlock2;  -  analogous  to  unlock  1 

end  case; 

where  service_lockI  is  the  following: 

case  (state  mod  4)  of 

II:  rv  :=  current;   current  +  =  inc; 
ul:  rv  :=  nextg;   nextg  +  =  inc; 

rv. reject  :=  3  +  state;  (11} 

12:  if  current  =  finished  then 

rv:=  0;   current  :=  inc;    finished  :  =  0; 
state  :=  (state  +  2)  mod  8;  {11} 

else 
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rv  :=  other;   other  +  =  inc; 
rv. reject  :  =  state  +  2\  {11} 

state  :=  (state  +   1)  mod  8;  {u2} 

endif; 
u2:  rv  :=  other;    other  +  =  inc; 

rv. reject  :=  state  +  2;  {11} 

end  case; 

and  service_unIockl  is; 

case  (state  mod  4)  of 

11:  rv  :=  finished;    finished  +  =  inc; 
ul:  rv  :=  finished;    finished  +  =  inc; 
if  current  =  finished  then 

current  :  =  other;    finished  :  =  0; 
other  :  =  nextg;   nextg  :  =  0; 
if  other  =  0  then     {12} 

state  :=  (state  +   1)  mod  4; 
else  {u2}  •-       "^■ 

state  :  =  (state  +  2)  mod  4; 
endif; 
endif; 
12,  u2: 
error; 
end  case; 

The  code  for  the  lock2  and  unlock2  operations  is  more  complicated  than  in  the  previous 
version,  since  it  must  be  prepared  for  rejected  requests;  it  can  be  written  as  follows: 

function  lock2  (b:  pointer  to  two_way_lock)  returns  integer  is 

r  :  integer; 

w  :  integer; 
begin 

r  :=  Fc&A(b  +  lock2,  1); 
if  r. state  <  >  0  then  {busy  wait} 

w  :=  (r.state  mod  8)  /  2;  {ignore  l.s.  bit} 

r.state  :=  0; 

while  w  <  >  b. state/2  loop; 
end  if; 
return  r; 
end  lock2; 
procedure  unlock2  (b:  pointer  to  two_way_lock)  is 
begin 

F&A  (b  +  unlock2,  1); 
end  !ock2; 

This  makes  use  of  the  state  field  in  the  returned  value,  with  non-7cro  values  used  to  indicate 
two  of  eight  states  (two  iterations  of  the  4-state  sequence)  which  the  process  must  wait  for. 
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7.  Serialization 

The  non-busy-waiting  implementation  of  the  two-way-lock  given  above  provides  the 
operations  in  a  single  access  to  the  global  memory.  The  waiting  time  will  be  less  than 
t(codel)-l- t(code2)  in  the  best  case,  where  t(codei)  is  the  time  to  execute  the  code  between 
locki  and  unlocki.  However,  there  are  two  possible  problems. 

The  first  of  these  is  that  the  queues  in  the  combining  switches  could  fill  up  (particularly 
those  near  the  memory),  limiting  subsequent  combinings.  In  general,  this  would  suggest  tiie 
use  of  longer  queues,  but  there  is  no  information  available  at  the  moment  to  suggest  how  long 
much  longer,  since  the  traffic  pattern  induced  by  the  delayed  response  to  F&A  operations  is 
dependent  on  algorithmic  details. 

The  second  is  that  the  queues  in  the  A&Lp  could  become  long,  resulting  in  the  effective 
serialization  of  the  responses.  In  the  case  of  the  two-way  lock  this  will  often  not  be  a  problem, 
since  the  lock  is  typically  used  relatively  lightly.  However,  in  the  case  of  the  barrier  synchroni- 
zation, a  seemingly  simpler  problem,  serialization  of  responses  is  more  serious.  A  general  solu- 
tion, discussed  in  more  detail  in  connection  with  barrier  synchronization  below,  may  be  for  the 
A&Lp  to  limit  the  length  of  the  delayed-response  queues.  Responses  would  be  queued  till  the 
queue  reached  a  specific  size,  at  which  point  the  oldest  request  would  be  sent  a  rejection 
response.  Busy-waiting  would  then  be  needed. 

8.  Hardware  Multiprogramming  in  the  PEs 

A  further  improvement  in  performance  is  possible  if  the  PEs  are  doing  hardware  multipro- 
gramming, i.e.  switching  between  processes  while  waiting  for  a  response  from  the  global 
memory.  This  technique  has  been  suggested  as  a  method  for  reducing  the  effect  of  network 
latency  [e.g.  SBK77,  BS81].  In  this  case  a  delayed  response  to  a  synchronization  request  does 
not  necessarily  waste  PE  time. 

9.  Other  Primitives 

We  give  brief  descriptions  of  implementations  of  a  number  of  other  parallel  primitives. 

9.1.    Queue  and  Stack  Operations 

Parallel  queue  operations  are  important  because  their  efficiency  has  an  immediate 
influence  on  the  granularity  of  processes.    The  structure  described  in  [GLR83]  uses  a  variant  of 
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the  usual  circular  array,  with  insertion  and  extraction  pointers  modified  by  F&A  operations, 
and  with  upper  bounds  on  the  number  of  insertions  and  extractions  in  progress.  In  addition,  a 
lock  was  required  on  each  element. 

A  solution  for  this  was  given  in  [H87],  which  reduced  the  number  of  accesses,  but  which 
involved  busy-waiting  in  several  situations. 

A  non-busy-waiting  implementation  of  queue  operations  can  he  obtained  by  use  of  the 
two-way-lock,  as  suggested  above.  However,  it  is  also  clear  that  there  is  some  redundancy  in 
this  solution.  We  can  remove  this  redundancy  by  storing  the  queue  parameters  (queue  size, 
and  insertion  and  extraction  indices)  in  the  A&L  object.  Since  the  lock  operation  in  the  two- 
way-lock  algorithm  returns  the  index  within  the  current  group,  it  can  be  modified  to  add  the 
insertion  (or  extraction)  index  of  the  array  used  for  the  queue  (or  stack).  This  index  would  be 
updated  when  switching  from  inserting  to  extracting,  and  vice  versa. 

Queue  overflow  or  underflow  can  be  detected  by  the  A&Lp,  which  can  deal  with  it  in  a 
number  of  ways:  by  returning  an  illegal  index;  by  not  returning  any  index;  or  by  abandoning 
the  FIFO  processing  of  requests.  The  best  solution  is  perhaps  a  combination  of  the  latter 
two:  on  overflow,  close  the  current  (inserting)  group,  and  queue  the  request  for  the  next  insert- 
ing group;  keep  the  next  extracting  group  open  till  enough  extractions  have  been  accepted  to 
permit  a  response  to  at  least  one  of  the  delayed  insertion  requests. 

The  code  to  insert  an  item  in  a  queue  would  then  reduce  to: 

Queue[F&A(StartInsertAddress,l)  mod  QueueSize]  :=  item: 
F&A(InsertDoneAddress,l); 

with  similar  code  for  extraction.  This  provides  queue  operations  with  no  busy-waiting  and  no 
requirement  for  read-write  interlocks. 

Stack  operations  can  be  implemented  similarly. 

9.2.    Readers/Writers 

The  standard  F&A  implementation  of  readers/writers  uses  PVchunk  operations,  P  and  V 
operations  with  a  large  value  of  increment.  An  alternative  implementation  using  the  two-way 
lock  can  be  used  instead. 

type  rwitem  is    record 

twi:  two_way_lock; 
contents:  item; 
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numwriters:  integer  :  =  0; 
end; 

procedure  read  (a:  rwitem)  returns  item  is 

w:  item; 
begin 

if  lock  1  (a.twl)  =  0  then 
a. numwriters  :  =  0; 

endif; 

w  :=  a. contents; 

unlockl  (a.twl); 
end  read; 
procedure  write  (a:  nvitem;  w:  item)  is 

i:  integer; 
begin 

i  :=  lock2  (a.twl); 

while  a. numwriters  <  >  i  loop; 

a. contents  :  =  w; 

a. numwriters  +  =   1; 

unlock2  (a.twl); 
end  write; 

A  significant  difference  is  that  this  solution  would  only  require  F&A  operations  with  unit 
increments,  since  the  opcode  can  be  used  to  differentiate  a  read  from  a  write. 

9.3.    Group  Lock 

The  group  lock,  proposed  by  Dimitrovsky,  is  an  object  with  three  operations  (glock, 
gsynch,  and  gunlock)  which  has  the  property  that  if  a  number  of  processes  execute: 

glock(g);  A;  gsynch(g);  B;  gunlock(g) 

or 

glock(g);  A;  gunlock(g); 

no  instance  of  A  will  execute  at  the  same  time  as  a  B.    Furthermore,  there  should  be  no  serial- 
ization or  starvation. 

9.3.1.    Implementation 

The  implementation  of  a  group  lock  using  A&U  is  described  in  [H86];  this  is  complicated 
by  code  to  permit  the  PE  to  identify  which  state  to  busy-wait  for.  fhc  implementation  is  con- 
siderably simplified  if  the  A&Lp  can  queue  responses,  since  no  busy-waiting  is  required  at  all. 
We  can  represent  the  state  of  a  group  lock  by  an  object  which  contains  the  state  of  the  lock 
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(open,  closed,  synched),  the  number  of  PEs  still  in  the  current  group,  the  number  of  PEs  in  the 
current  group  which  have  executed  a  synch  operation,  and  a  queue  of  waiting  requests  to 
enter  the  group,  and  the  number  of  PEs  which  have  been  assigned  to  the  next  group. 

For  a  group  lock  at  base  address  b,  a  glock  request  will  be  implemented  as 
F&A(b  +  glock,l),  gsynch  as  F&A(b  + gsynch.l),  and  gunlock  as  F&A(b+ gunlock,!).  The  PE 
does  not  have  to  check  the  response,  since  all  requests  are  accepted;  a  lock  request  is  accepted 
either  into  the  current  group  or  the  next,  but  in  the  latter  case  the  response  is  only  sent  when 
it  is  in  the  group.   Acceptance  into  the  current  group  is  permitted  if  no  synchs  or  unlocks  have 

been  requested.   The  group  lock  object  has  the  following  components: 

type  glocktype  is 
record 

state:    (open,  closed,  synch)  :=  open; 
current:  integer  :  =  0; 
synched:  integer  :  =  0; 
nextg:  queue  :  =  nil; 
end; 

The  algorithm  is  as  follows  (we  use  the  notation  a  +=   bfora  =  a  +  b  and  (x?y:z)  for  if  x 

then  y  else  z): 

case  op  of 
glock: 

case  state  of 

open:  current  +  =  inc;    reply(ok); 

closed,  synch:         enqueue(nextg); 
end  case; 
gsynch: 

synched  +  =  inc; 

state  :=  (synched  =  current  ?  synch:  closed); 
reply(ok); 
gunlock: 

current  -  =  inc; 
if  current  =  0  then 

replyall(  nextg);    nextg  :=  nil; 
current  :  =  0;   synched  :  =  0;  state  :  =  open; 
elsif  synched  =  current  then 

state  :  =  synch; 
end  if; 
end  case; 

Mere  there  are  three  active  states,  open,  closed,  and  synch,  fhc  state  is  open  if  no  gsynch 
requests  have  been  received,  closed  if  insudicient  gsynch  requests  have  been  received,  and 
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synch  if  insufficient  gunlocks.  (Note  that  this  implementation  difTers  from  that  in  [D86|,  in 
that  a  group  is  only  closed  if  a  synch  request  is  received;  this  eliminates  PEs  having  to  wait 
unnecessarily).  This  code  does  not  return  position  within  the  group,  though  this  could  easily 
be  added. 

9.4.  Single-PE  Interlocking  Reads  and  Writes  (Full/Empty  Flags) 

A  common  synchronization  primitive,  often  implemented  in  the  hardware  (e.g.  the  IIEP's 
wait-and-read,  the  Butterfly's/ Monarch's  steal  operation)  associates  a  full/empty  flag  with  a 
memory  word,  and  provides  operations  which  permit  this  to  he  read  only  if  full,  or  written 
only  if  empty. 

If  used  for  communication  between  no  more  than  two  processors,  these  are  simply  pro- 
vided by  F&A  operations  with  dilTerent  opcodes.  The  F&A  object  would  contain  a  data  field 
and  a  full/empty  bit.  A  read  will  he  executed  as  a  F&A(RcadAddress,0),  and  a  write  as 
F&A(WriteAddress,NewValue).  These  will  not  combine,  and  the  F&A  processor  can  hold  up 
the  response  till  the  appropriate  condition  holds. 

9.5.  Lock  and  Unlock 

A  not-very-busy-waiting  lock  and  unlock  can  be  implemented  as  follows.  The  A&Lp 
would  maintain  two  queues,  a  waiting-queue  of  requests  that  arrived  when  the  lock  was 
locked,  and  an  accepted-queue  of  requests  that  were  accepted  but  not  yet  satisfied.  The  lock 
operation  would  be  F&A(LockAddress  +  lock,l),  which  could  be  combined.  If  the  lock  was 
locked,  the  A&Lp  would  put  the  request  on  the  waiting-queue.  If  unlocked,  the  A&Lp  would 
return  a  zero,  and  save  the  number  of  requests;  the  PF  receiving  a  zxro  response  would  have 
the  lock,  and  PEs  receiving  a  non-zero  value  would  send  a  r'&A(l,ockAddress  + retry,  1 ),  which 
would  be  combined  and  queued  (on  the  accepted-queue)  if  the  lock  was  still  locked,  and  pro- 
cessed like  a  lock  request  otherwise  (but  before  the  requests  on  the  waiting-queue).  The  lock 

code  would  be  of  the  form: 

if  F&A(LockAddr+lock,l)  <>  0  then 

while  F&A(Locky\ddr  + retry,  1)  <  >  0  loop  null  endloop; 

Here  retries  are  only  used  in  the  case  of  requests  which  combine. 

The  unlock  operation  would  be  F&A(LockAddress  + unlock, 0)  which  would  permit  the 
A&Lp  to  respond  to  a  delayed  request  on  the  accepted-queue.   If  the  accepted-queue  was 
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empty  and  the  correct  number  of  retries  had  been  accepted,  the  A&Lp  would  process  the 
waiting-queue.  Processing  the  correct  number  of  retries  ensures  fairness,  since  requests  are 
processed  in  FIFO  order. 

Note  that  serialization  is  not  a  problem  here,  since  this  operation  is  inherently  serializing 
(and  should  be  used  only  as  a  last  resort  anyway). 

9.6.  Multiple-PE  Read-when-full  /  Steal 

This  operation  is  like  lock,  but  returns  a  value  to  one  requesting  PE;  this  value  is 
returned  in  another  field.   The  PE  code  could  be  written: 

v:=  F&A(DataAddr4-steal,l); 
if  (v  &  CountMask)  <  >  0  then 
repeat 

v:=  F&A(DataAddr+ retry,!); 

until  (v  &  CountMask)  =  0; 

The  write  operation  resets  the  full/empty  bit,  and  is  just: 
F&A(DataAddr+  writefill,  value); 

The  assumption  is  that  only  one  PE  (that  having  stolen  the  value)  will  be  writing,  so  there  is 
no  problem  with  combining. 

Note  that  the  implementation  of  a  multiple-PE  write-when-empty  operation  is  more 
difficult,  because  the  data  being  written  can  not  in  general  he  combined,  and  there  is  no  simple 
way  to  prevent  it.  If  multiple  writes  are  necessary,  a  separate  lock  will  have  to  be  used. 

9.7.  Barrier  Synchronization 

This  problem  was  chosen  as  a  simple  example  of  how  ,'\&L  can  avoid  busy-waits.  How- 
ever, the  naive  solution  presented  above  may  encounter  problems  if  there  arc  a  large  number 
of  PEs  synchronizing,  since  the  response  queue  effectively  serializes  the  responses  by  the 
A&Lp.  A  better  solution  is  for  the  A&Lp  to  limit  the  size  of  the  delayed-response  queue  by 
responding  negatively  to  an  old  request  when  a  new  one  is  queued.  The  A&Lp  could  return  a 
value  BigNum  (larger  than  the  number  of  PEs)  to  indicate  non-synchronization.  The  PEs 
would  then  submit  a  retry  request,  so  the  code  would  be: 

if  F&A(BarrierAddr4-synch,l)  >  =  BigNum  then 

while  F&A(BarrierAddr  + retry,!)  >   BigNum  loop  null  cndloop; 

The  A&Lp  will  keep  a   count   of  retry  requests   as  well   as   synch   requests,  and  not   start 
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processing   any   synch   requests   while  there   arc   PEs   whose   retry  requests   have  not   been 
accepted  (the  synch  requests  might  be  for  the  next  barrier  synchronization). 

This  version  of  barrier  synchronization  does  busy-wait,  but  the  busy-wait  cycle  is  longer. 
Although  it  does  not  put  a  clear  bound  on  the  amount  of  serialization  caused  by  the  A&Lp 
queue,  its  performance  would  seem  to  be  worthy  of  study. 

10.  Simplification  of  the  combining  F&A  switch 

We  note  that  most  of  the  above  operations  are  implemented  by  a  F&A  operation  with  an 
increment  of  1,  so  it  is  possible  that  a  simpler  implementation  of  the  combining  switch  may  be 
possible  than  in  the  standard  F&A  algorithms.  For  example,  the  standard  F&A  implementa- 
tion of  readers/writers  need  an  increment  of  N  (the  number  of  PCs);  with  A&L  the  write 

request  can  be  communicated  by  using  a  different  address. 

,,      ,     ■ 

11.  Problems 

11.1.  Interrupts  -  .  . 

One  defficiency  of  implementing  operations  without  busy-waiting  is  that  the  PE  may  not 
be  able  to  respond  to  interrupts  while  waiting.  If  the  expected  waiting  time  is  short,  this 
should  create  no  more  problems  than  masking  interrupts  in  single-processor  systems.  Further- 
more, many  operations  might  reasonably  be  required  to  be  indivisible;  for  example,  context 
switching  in  the  middle  of  a  queue  operation  might  block  all  use  of  the  queue,  and  in  the  mid- 
dle of  a  barrier  synchronization  operation  might  cause  other  PEs  to  wait. 

In  some  cases  it  may  be  necessary  to  make  use  of  a  busy-waiting  form  of  the  operation 
to  permit  interrupts.  One  way  of  doing  this  might  be  to  require  that  the  A&Lp  respond 
within  a  certain  specified  time.  This  could  be  implemented  in  a  number  of  ways:  a  PE  could 
send  time-out  signals  in  the  form  of  A&L  operations  to  the  A&Lps;  the  A&Lp  could  maintain 
and  interrogate  a  counter  (corresponding  to  an  approximate  time);  or  the  A&Lp  could  be  pro- 
vided with  a  timer  and  interrupt  mechanism. 

11.2.  Errors 

The  ability  of  the  A&Lp  to  delay  responses  to  F&A  operations  introduces  a  new  class  of 
programming  errors.  These  are  errors  which  result  in  PEs  which  arc  deadlocked  waiting  for 
responses  from  the  A&Lp  which  are  never  sent;  For  example,  if  a  PFt  does  a  start-queue-inscrt 
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but  does  not  follow  it  with  the  corresponding  end-queue-insert  operation,  all  PEs  waiting  for 
access  to  the  queue  will  be  deadlocked.  To  handle  this  problem,  we  require  a  mechanism  to 
detect  it,  and  a  mechanism  to  recover  from  it. 

Detection  can  be  provided  by  an  A&L  object  of  type  'status',  which  will  have  one  opera- 
tion on  it:  F&A(StatusAddr+ Inquire, PE)  will  return  the  status  of  the  specified  PE.  More 
specifically,  it  will  return  the  address  and  operation  that  the  PE  is  expecting  a  response  to,  and 
zero  otherwise.  This  operation  will  not  be  combinable,  for  obvious  reasons,  so  its  use  must  be 
protected  by  a  lock.  There  must  be  at  least  one  status  object  for  every  A&Lp. 

Recovery  is  a  little  trickier,  because  the  code  in  the  PE  has  (presumably)  been  written  to 
assume  that  there  is  no  busy-waiting,  and  that  any  response  will  have  satisfied  the  original 
request.  It  is  straightforward  to  implement  an  A&L  object  to  accept  an  operation  of  the  form 
F&A(CommandAddr+  Release, PE),  which  will  force  the  A&Lp  to  return  an  error  response  to 
a  PE  with  a  delayed  request.  The  problem  is  that  this  response  may  cause  the  PE  to  make 
further,  and  possibly  disasterous  errors. 

There  are  two  obvious  solutions:  to  provide  an  error  response  to  every  A&L  operation, 
which  the  PE  can  test  for  (at  the  expense  of  some  inefficiency,  perhaps);  or  by  providing  a 
hardware  error  signal  which  will  cause  a  trap  in  the  PE. 
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